{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 数据采集与分析工具\n",
    "\n",
    "钉钉群(建立课程钉钉群)\n",
    "\n",
    "学校慕课系统\n",
    "http://hznu.fanya.chaoxing.com/portal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## python2.7 与 Python3 区别\n",
    "\n",
    "* print要加括号！\n",
    "\n",
    "https://www.cnblogs.com/big-devil/p/7625894.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "print(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 新课程的内容\n",
    "\n",
    "* 大纲，实验计划\n",
    "\n",
    "* 数据分析 or 数据采集\n",
    "\n",
    "* Python的数据生态系统：numpy,scipy,pandas,beatifulsoup,scarpy,matplotlib,pyecharts...（见xmind）、\n",
    "\n",
    "Stay hungry，Stay foolish "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Windows 下 Python 安装与使用\n",
    "\n",
    "* winpython \n",
    "\n",
    "主页，我们的版本是WinPython 3.7.0.2-64bit，可以自己去主页上下载：\n",
    "https://winpython.github.io/\n",
    "\n",
    "也可以去我的百度云，64位下载地址：\n",
    "https://pan.baidu.com/s/1Yq2iwCMXN7jKYXIC_-6cTw\n",
    "\n",
    "* 双击安装，其中一些选项默认就好，但注意选择安装目录的时候需要安装在一个英文目录下面，目录中不要有中文"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python基础知识回顾\n",
    "\n",
    "* 需要吗？\n",
    "* 复杂数据类型：列表，字典等\n",
    "* 程序流程控制：循环，条件，函数\n",
    "* 文件操作，计数的方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9 [2, 1] 4\n"
     ]
    }
   ],
   "source": [
    "#列表\n",
    "a=[9,2,1,4]\n",
    "print(a[0],a[1:3],a[3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[9, 2, 1, 4, 4]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.append(4)\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[9, 2, 7, 1, 4, 4]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.insert(2,7)\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n",
      "dict_values([4, 1]) dict_keys([2, 5]) dict_items([(2, 4), (5, 1)])\n"
     ]
    }
   ],
   "source": [
    "d={2:4,5:1}#键值对\n",
    "print(d[5])\n",
    "print(d.values(),d.keys(),d.items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n",
      "1\n"
     ]
    }
   ],
   "source": [
    "#print(list(d.values())[0])\n",
    "for i in d.values():\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "9\n",
      "8\n",
      "7\n",
      "6\n",
      "5\n"
     ]
    }
   ],
   "source": [
    "a=9\n",
    "while a>4:\n",
    "    print(a)\n",
    "    a=a-1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n",
      "3\n",
      "4\n"
     ]
    }
   ],
   "source": [
    "for i in range(2,5):\n",
    "    print(i)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "8"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d={1:4}\n",
    "d.get(2,8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{1: 4, 4: 3}\n"
     ]
    }
   ],
   "source": [
    "d[4]=3\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{1: 4, 4: 3, 3: 7}\n"
     ]
    }
   ],
   "source": [
    "d.update({3:7})\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7\n"
     ]
    }
   ],
   "source": [
    "print(d.get(3,2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{1: 4, 5: 2}\n"
     ]
    }
   ],
   "source": [
    "d={1:4}\n",
    "d.update({5:2})\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{1: 4, 5: 2, 3: 6}\n"
     ]
    }
   ],
   "source": [
    "d[3]=6\n",
    "print(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6\n"
     ]
    }
   ],
   "source": [
    "print(d.get(3,3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{2: 2, 4: 2, 1: 1}\n"
     ]
    }
   ],
   "source": [
    "a=[2,4,1,2,4]\n",
    "r={}\n",
    "for i in a:\n",
    "    r[i]=r.get(i,0)+1\n",
    "print(r)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10\n"
     ]
    }
   ],
   "source": [
    "a=[3,2,1,0,4]\n",
    "c=0\n",
    "for i in a:\n",
    "    c=c+i\n",
    "print(c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# pandas\n",
    "\n",
    "* 最重要的基础数据分析处理包"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# pandas的数据类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## series的创建\n",
    "\n",
    "* 无论是series还是dataframe都是在numpy的array基础上构建的\n",
    "* 序列的两个部分：元素与索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "c={\"i\":2,\"u\":3,3:6}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "i    2\n",
       "u    3\n",
       "1    6\n",
       "dtype: int64"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=pd.Series([2,3,6],index=[\"i\",\"u\",1])#值必须类型一样\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a[2]#双索引机制"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "i    8\n",
       "u    3\n",
       "1    6\n",
       "dtype: int64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#赋值\n",
    "a[0]=8.32\n",
    "#a[0]=\"aaa\"\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([8, 3, 6], dtype=int64)"
      ]
     },
     "execution_count": 164,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 165,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['i', 'u', 3], dtype='object')"
      ]
     },
     "execution_count": 165,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "w    2\n",
       "2    4\n",
       "3    3\n",
       "o    5\n",
       "r    6\n",
       "dtype: int64"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#z\n",
    "b=pd.Series({\"w\":2,2:4,3:3,\"o\":5,\"r\":6})#字典形式\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    2\n",
       "1    4\n",
       "2    3\n",
       "3    5\n",
       "4    6\n",
       "dtype: int64"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=pd.Series([2,4,3,5,6])#会自动添加索引\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "r    6\n",
       "o    5\n",
       "dtype: int64"
      ]
     },
     "execution_count": 176,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b=pd.Series({\"w\":2,2:4,3:3,\"o\":5,\"r\":6},index=[\"r\",\"o\"])#只会选取index中的索引\n",
    "b"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## dataframe的创建\n",
    "\n",
    "* 一个二维数组并且横向与纵向都有自定义索引\n",
    "* dataframe可以看成一系列有相同索引的series\n",
    "* 数据框的三个部分：元素，行标，列表\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.93058142,  0.77019753, -0.51476983],\n",
       "       [-1.00178584, -0.48270254,  0.73043323],\n",
       "       [ 0.54292453,  0.48071071, -1.58201189],\n",
       "       [ 1.57962945,  0.2586103 ,  0.36540852],\n",
       "       [ 0.47069817,  0.58140324,  0.90716359],\n",
       "       [-1.14667799, -2.17332435,  0.90351004]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.random.randn(6,3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.271704</td>\n",
       "      <td>0.634301</td>\n",
       "      <td>0.405262</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.120751</td>\n",
       "      <td>-0.879684</td>\n",
       "      <td>-0.181473</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.953220</td>\n",
       "      <td>0.453240</td>\n",
       "      <td>0.149582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.660946</td>\n",
       "      <td>0.461971</td>\n",
       "      <td>-0.560160</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.152444</td>\n",
       "      <td>0.199540</td>\n",
       "      <td>-0.237531</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0.817927</td>\n",
       "      <td>-0.364992</td>\n",
       "      <td>-1.040131</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          a         b         c\n",
       "2  1.271704  0.634301  0.405262\n",
       "3 -0.120751 -0.879684 -0.181473\n",
       "4  0.953220  0.453240  0.149582\n",
       "5  0.660946  0.461971 -0.560160\n",
       "6  0.152444  0.199540 -0.237531\n",
       "7  0.817927 -0.364992 -1.040131"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=pd.DataFrame(np.random.randn(6,3),index=range(2,8), columns=[\"a\",\"b\",\"c\"])#相对数组增加了横向和纵向的自定义索引，双索引机制\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2    0.634301\n",
       "3   -0.879684\n",
       "4    0.453240\n",
       "5    0.461971\n",
       "6    0.199540\n",
       "7   -0.364992\n",
       "Name: b, dtype: float64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a[\"b\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "#a[3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.05118313,  0.39638064,  0.38787328],\n",
       "       [ 0.55483877, -1.17624348,  0.8606569 ],\n",
       "       [-0.90502513,  2.06223449, -0.27556449],\n",
       "       [-0.45554163,  0.007331  , -0.89722409],\n",
       "       [-1.2040777 ,  0.20244219,  0.51471742],\n",
       "       [-0.44531058, -0.49707473, -1.94881836]])"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=2, stop=8, step=1)"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b', 'c'], dtype='object')"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.664421</td>\n",
       "      <td>0.495775</td>\n",
       "      <td>-0.945343</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.443564</td>\n",
       "      <td>2.499683</td>\n",
       "      <td>0.145670</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.385458</td>\n",
       "      <td>-0.146777</td>\n",
       "      <td>0.397846</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.307054</td>\n",
       "      <td>0.404264</td>\n",
       "      <td>-0.754313</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.473609</td>\n",
       "      <td>-0.243507</td>\n",
       "      <td>-0.583231</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-2.057494</td>\n",
       "      <td>-1.082490</td>\n",
       "      <td>0.376665</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          a         b         c\n",
       "0  1.664421  0.495775 -0.945343\n",
       "1 -0.443564  2.499683  0.145670\n",
       "2 -0.385458 -0.146777  0.397846\n",
       "3 -0.307054  0.404264 -0.754313\n",
       "4 -0.473609 -0.243507 -0.583231\n",
       "5 -2.057494 -1.082490  0.376665"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=pd.DataFrame(np.random.randn(6,3), columns=[\"a\",\"b\",\"c\"])\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.750190</td>\n",
       "      <td>-2.037230</td>\n",
       "      <td>1.342428</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.004734</td>\n",
       "      <td>-0.160271</td>\n",
       "      <td>-0.787664</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.166225</td>\n",
       "      <td>0.677395</td>\n",
       "      <td>1.055825</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.004122</td>\n",
       "      <td>0.418734</td>\n",
       "      <td>2.593051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.641335</td>\n",
       "      <td>-0.828338</td>\n",
       "      <td>0.882715</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.802709</td>\n",
       "      <td>0.019870</td>\n",
       "      <td>0.257846</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          0         1         2\n",
       "0  2.750190 -2.037230  1.342428\n",
       "1  0.004734 -0.160271 -0.787664\n",
       "2  0.166225  0.677395  1.055825\n",
       "3 -0.004122  0.418734  2.593051\n",
       "4 -0.641335 -0.828338  0.882715\n",
       "5 -0.802709  0.019870  0.257846"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a=pd.DataFrame(np.random.randn(6,3))#会自动把隐式索引添加上去\n",
    "a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b\n",
       "0  2  3\n",
       "1  3  4"
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b=pd.DataFrame({\"a\":[2,3],\"b\":[3,4]})#通过字典定义dataframe\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>w</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>o</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>r</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0\n",
       "w  2\n",
       "2  4\n",
       "3  3\n",
       "o  5\n",
       "r  6"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b=pd.Series({\"w\":2,2:4,3:3,\"o\":5,\"r\":6})#通过序列生成数据框\n",
    "pd.DataFrame(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>c1</th>\n",
       "      <th>c2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>w</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>o</th>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   c1  c2\n",
       "w   2   2\n",
       "o   5   5"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame({\"c1\":b,\"c2\":b},index=[\"w\",\"o\"])#通过series定义dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>w</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>o</th>\n",
       "      <th>r</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   w  2  3  o  r\n",
       "0  2  4  3  5  6"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame([b])#注意这里转置了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 通过文件读入dataframe\n",
    "\n",
    "* 注意源文件是否有列标，没有的时候可以自己用names补充"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "r1=pd.read_csv(r\"D:\\t_alibaba_data3.txt\",names=[\"user\",\"brand\",\"behavr\",\"date\"],sep=\"\\t\",dtype={\"behavr\":int})\n",
    "#pandas会自己判断数据类型，但是有时也需要自己额外指定数据类型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user</th>\n",
       "      <th>brand</th>\n",
       "      <th>behavr</th>\n",
       "      <th>date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10944750</td>\n",
       "      <td>13451</td>\n",
       "      <td>0</td>\n",
       "      <td>06/04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10944750</td>\n",
       "      <td>13451</td>\n",
       "      <td>2</td>\n",
       "      <td>06/04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10944750</td>\n",
       "      <td>13451</td>\n",
       "      <td>2</td>\n",
       "      <td>06/04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10944750</td>\n",
       "      <td>13451</td>\n",
       "      <td>0</td>\n",
       "      <td>06/04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>10944750</td>\n",
       "      <td>13451</td>\n",
       "      <td>0</td>\n",
       "      <td>06/04</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       user  brand  behavr   date\n",
       "0  10944750  13451       0  06/04\n",
       "1  10944750  13451       2  06/04\n",
       "2  10944750  13451       2  06/04\n",
       "3  10944750  13451       0  06/04\n",
       "4  10944750  13451       0  06/04"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "r1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "#r1[\"date\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 索引\n",
    "\n",
    "* 元素所处的位置，和列表的位置，字典的键类似\n",
    "\n",
    "* 实现过滤的原理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {},
   "outputs": [],
   "source": [
    "a1=pd.Series([2,3,6,0],index=[7,9,3,2])\n",
    "b1=pd.Index([0,1,2,3])#也可以通过Index自定义索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([7, 9, 3, 2], dtype='int64')"
      ]
     },
     "execution_count": 202,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a1.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([0, 1, 2, 3], dtype='int64')"
      ]
     },
     "execution_count": 203,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([3, 2], dtype='int64')"
      ]
     },
     "execution_count": 204,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a1.index&b1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([0, 1, 2, 3, 7, 9], dtype='int64')"
      ]
     },
     "execution_count": 205,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a1.index|b1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Int64Index([0, 1, 7, 9], dtype='int64')"
      ]
     },
     "execution_count": 155,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a1.index^b1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3    6\n",
       "2    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 156,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a1[a1.index&b1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 208,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b1[0]\n",
    "#b1[0]=9#不可更改"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
